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Abstract 

The density matrices are positively semi-definite Hermitian matrices of unit trace that 
describe the state of a quantum system. The goal of the paper is to develop minimax lower 
bounds on error rates of estimation of low rank density matrices in trace regression models 
used in quantum state tomography (in particular, in the case of Pauli measurements) 
with explicit dependence of the bounds on the rank and other complexity parameters. 

Such bounds are established for several statistically relevant distances, including quantum 
versions of Kullback-Leibler divergence (relative entropy distance) and of Hellinger distance 
(so called Bures distance), and Schatten p-norm distances. Sharp upper bounds and oracle 
inequalities for least squares estimator with von Neumann entropy penalization are obtained 
showing that minimax lower bounds are attained (up to logarithmic factors) for these 
distances. 

Keywords: quantum state tomography, low rank density matrix, minimax lower bounds 

1. Introduction 

This paper deals with optimality properties of estimators of density matrices, describing 
states of quantum systems, that are based on penalized empirical risk minimization with 
specially designed eomplexity penalties such as von Neumann entropy of the state. Alexey 
Chervonenkis was a co-founder of the theory of empirical risk minimization that is of cru¬ 
cial importance in machine learning, but he also had very broad interests that included, 
in particular, quantum mechanics. By the choice of the topic, we would like to honor the 
memory of this great man and great scientist. 

Let Mm(C) be the set of all m x m matrices with complex entries and let = Blm(C) C 
Mm(C) be the set of all Hermitian matrices: Mm = {A E Mm(C) : A = A*}, A* denoting 
the adjoint matrix of A. For A E Hm, tr(H) denotes the trace of A and A 0 means 
that A is positively semi-definite. Let Sm ■= {S' E : S ^ 0,tr(S) = 1} be the set of 
all positively semi-definite Hermitian matrices of unit trace called density matrices. In 
quantum mechanics, the state of a quantum system is usually characterized by a density 
matrix p E Sm (or, more generally, by a self-adjoint positively semi-definite operator of unit 
trace acting in an infinite-dimensional Hilbert space, called a density operator). Often, very 
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large density matrices are needed to represent or to approximate the density operator of the 
state. For instance, for a quantum system consisting of b qubits, the density matrices are 
of the size m x m with m = 2^, so the dimension of the density matrix grows exponentially 
with b. For instance, for a 10 qubit system, one has to deal with matrices that have 2^*^ 
entries. Thus, it becomes natural in the problems of statistical estimation of density matrix 
p to take an advantage of the fact that it might be low rank, or nearly low rank (that is, 
it could be well approximated by low rank matrices) which reduces the complexity of the 
estimation problem. 

In quantum state tomography (QST), the goal is to estimate an unknown state p G Sm 
based on a number of specially designed measurements for the system prepared in state 
p (see Gross et al. 2010, Gross 2011, Koltchinskii 2011a, Gai et al. 2015 and references 
therein). Given an observable A G with spectral representation A = where 

m' < m, \j being the eigenvalues of A and Pj being the corresponding mutually orthogonal 
eigenprojectors, the outcome of a measurement of A for the system prepared in state p is a 
random variable Y taking values Xj with probabilities tr(pPj). The expectation of Y is then 
EpK = tr(pyl), so, Y could be viewed as a noisy observation of the value of linear functional 
tr(pA) of the unknown density matrix p. A common approach is to choose an observable 
A at random, assuming that it is the value of a random variable X with some design 
distribution FI in the space Hm. More precisely, given a sample of n i.i.d. copies Xi,..., X„ 
of X, n measurements are being performed for the system identically prepared n times in 
state p resulting in outcomes Ki,..., 1^. Based on the data (Xi, Fi),..., (X„, y„), the goal 
is to estimate the target density matrix p. Clearly, the observations satisfy the following 
model 

Yj = tr(pXj) + j = 1,..., n, (1) 

where {^j} is a random noise consisting of n i.i.d. random variables satisfying the condition 
Kp{^j\Xj) = 0,j = 1,... ,n. This is a special case of so called trace regression model inten¬ 
sively studied in the recent literature (see, e.g., Koltchinskii et al. 2011, Koltchinskii 2011b 
and references therein). 

1.1 Assumptions 

A common choice of design distribution in this type of problems is so called uniform sampling 
from an orthonormal basis described in the following assumptions. 

Assumption 1 Let £ = {Ki,... ,K^ 2 } C be an orthonormal basis of Mm with respect 
to the Hilbert-Schmidt inner product: {A, B) = tr{AB). Moreover, suppose that, for some 
U>0, 

11 Pj ||oo ^ U 1 j — l,...,?r, 

where || • ||oo denotes the operator norm (the spectral norm). 

Since ||lllj ||2 = 1; where || • ||2 denotes the Hilbert-Schmidt (or Frobenius) norm, we can 
assume that Lf <1. Moreover, U > since 1 = ||F'j ||2 < fn^^‘^\\Pj\\oo < 

Assumption 2 Let 11 be the uniform distribution in the finite set £ (see Assumption 1), 
let X be a random variable sampled from 11 and let Xi,..., X^ he i.i.d. copies of X. 
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It will be assumed in what follows that assumptions 1 and 2 hold (unless it is stated 
otherwise). Under these assumptions, Yi,... ,Yn could be viewed as noisy observations of 
a random sample of Fourier coefficients {p, Xi),..., {p, Xn) of the target density matrix 
p in the basis £. The above model (in which Xi,... ,Xn are uniformly sampled from an 
orthonormal basis and Yi,... ,Yn are the outcomes of measurements of the observables 
Xi,... ,Xn for the system being identically prepared n times in the same state p) will be 
called in what follows the standard QST model. It is a special case of trace regression model 
with bounded response: 

Assumption 3 (Trace regression with bounded responce) Suppose that Assumption 
1 holds and let {X,Y) be a random couple such that X is sampled from the uniform 
distribution 11 in an orthonormal basis £ C Hm- Suppose also that, for some p G Sm, 
IE(T|A) = {p,X) a.s. and, for some 17 > 0, |y| < 17 a.s.. The data {Xi,Yi),... {Xn,Yn) 
consists of n i.i.d. copies of{X,Y). 

We are also interested in the trace regression model with Gaussian noise: 


Assumption 4 (Trace regression with Gaussian noise) Suppose Assumption 1 holds 
and let {X,Y) be a random couple such that X is sampled from the uniform distribution 
n in an orthonormal basis £ C and, for some p G Sm, Y = {p,X) + where f is a 
normal random variable with mean 0 and variance cj|, f and X being independent. The 
data {Xi,Yi),... {Xn,Yn) consists of n i.i.d. copies of {X,Y). 


Note that this model is not directly applicable to the “standard QST problem” described 
above, where the response variable Y is discrete. However, if the measurements are repeated 
multiple times for each observable Xj and the resulting outcomes are averaged to reduce the 
variance, the noise of such averaged measurements becomes approximately Gaussian and it 
is of interest to characterize the estimation error in terms of the variance of the noise. 

An important example of an orthonormal basis used in quantum state tomography is 
so called Pauli basis, see, e.g., Gross et al. (2010), Gross (2011). The Pauli basis in the 
space IHI 2 of 2 x 2 Hermitian matrices (observables in a single qubit system) consists of four 
matrices Wi, W 2 , IT 3 , W 4 defined as Wi = i = 1,... , 4, where 


ai := 


1 

0 






0 

-1 


It is easy to check that {ITo) kUi, IT 2 , IT 3 } indeed forms an orthonormal basis in 1 HI 2 - The 
Pauli basis in the space for m = 2^ (the space of observables for a b qubits system) is de¬ 
fined by tensorisation, namely, it consists of 4^ tensor products VTq (g)...(g) IT 4 , {ii, . . .,%) G 
{1, 2,3, 4} . Let us write these matrices as Ei, ..., £"^2 with Ei = Wi (g)... (g) ITi. It is easy 
to see that each of them has eigenvalues and ||£j||oo = , so, for this basis, 

E = The fact that, for the Pauli basis, the operator norms of basis matrices are as 

small as possible plays an important role in quantum state tomography (Gross et al., 2010; 
Gross, 2011; Liu, 2011). Let Ej = be the spectral representation of Ej. 

Then, an outcome of a measurement of Ej in state p is a random variable Tj taking values 
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probabilities {p,Qf). Its expectation is = {p,Ej). Of course, there exists a 
unique representation of density matrix p in the Pauli basis that can be written as follows: 
P = with ai = 1. Then, we clearly have EpT,- = ^ and Pp|rj' = 

(for j = 1, this gives Pp|ti = = !)• As a consequence, Varp(rj) = Note that 

Z]j=i ^ ^ tr^(/>) = 1. This implies that there exists j such that aj < ^ and 

Varp(rj) > ^. In fact, the number of such j must be large, say, at least ^ (provided that 
m > 4). Thus, for “most” of the values of j, Varp(rj) x A, A way to reduce the variance is 
to repeat the measurement of each observable Xj K times (for a system identically prepared 
in state p) and to average the outcomes of such K measurements. The resulting response 

variable is Yj = {p,Xj) + where Ep{^j\Xj) = 0 and Ep{^‘j\Xj) = Varp(l^|Xj) = , 

Uj being defined by the relationship Xj = . 

1.2 Preliminaries and Notations 

Some notations will be used throughout the paper. The Euclidean norm in C™' will be 
denoted by || • || and the notation (•, •) will be used for both the Euclidean inner product in 
C™ and for the Hilbert-Schmidt inner product in Hm,. || • ||p,p > 1 will be used to denote the 

m 

Schatten p-norm in Hm, namely ||A||^ = |Aj(yl)|^, A G Hm, Ai(A) > ... > Xm{A) being 

j 

the eigenvalues of A. In particular, || • ||2 denotes the Hilbert-Schmidt (or Erobenius) norm, 
II • 111 denotes the nuclear (or trace) norm and || • ||oo denotes the operator (or spectral) norm; 

11HI loo = maxi<j<m |Aj(yl)| = |Ai(yl)|. The following well known interpolation inequality for 
Schatten p-norms will be used to extend the bounds proved for some values of p to the 
whole range of its values. It easily follows from similar bounds for I'p-spaces. 

Lemma 1 (Interpolation ineqnality) For l<p<q<r< oo, and let p, G [0,1] be such 
that 

P ^ P ^ 1 
p r q 

Then, for all A £ Hm, 



Given A G Mm, define a function /a ■ Hm E : fA{x) ■= {A,x),x G Hm. Eor a given 
random variable X in with a distribution H, we have ||/a 11 ^2(11) = Efl{X)=E{A,Xf. 
Sometimes, with a minor abuse of notation, we might write ||H|||^^j^^ = f^^{A,x)'^Il{dx) = 
In what follows, H will be typically the uniform distribution in an orthonormal 
basis £ = {Ei, ..., Emp} C Hm, implying that 

ll/AllL(n) = Plli2(n)="^-^PIIi, 

so, the L 2 (n)-norm is just a rescaled Hilbert-Schmidt norm. 

Consider A G Hm with spectral representation A = Y^^=iXjPj, m' < m with distinct 
non-zero eigenvalues Xj. Denote by sign(H) := J2]LiAgn{Xj)Pj and by supp(yl) the linear 
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span of the images of projectors Pj,j = (the subspace supp(A) C C™' will be 

called the support of A). 

Given a subspace L C C™', L-*- denotes the orthogonal complement of L and Pl denotes 
the orthogonal projection onto L. Let Pl,Pl be orthogonal projection operators in the 
space (equipped with the Hilbert-Schmidt inner product), defined as follows: 

Vi{A) = P^rAP^r, VL{A)=A-PLrAPLr. 

These two operators split any Hermitian matrix A into two orthogonal parts, Vl{A) and 
V^{A), the first one being of rank at most 2 dim(L). 

For a convex function / : e-?■ M, df{A) denotes the sub differential of / at the point 

A G Elm- It is well known that 

Spill = {sign(yl) + Vi{M) : M G e™, ||M||oo < l}, (2) 

where L = supp(yl) (see Koltchinskii 2011b, p. 240 and references therein). 

C,Ci,C',c,c', etc will denote constants (that do not depend on parameters of interest 
such as m, n, etc) whose values could change from line to line (or, even, within the same 
line) without further notice. For nonnegative A and B, A < B (equivalently, B > A) means 
that A < CB for some absolute constant C > 0, and A B means that A < B and 
B < A. Sometimes, symbols <,> and x could be provided with subscripts (say, A B) 
to indicate that constant C may depend on a parameter (say, 7 ). 

In what follows, P denotes the distribution of {X,Y) and Pn denotes the corresponding 
empirical distribution based on the sample (Xi, Yi),..., {Xn,Yn) of n i.i.d. observations. 
Similarly, 11 is the distribution of X (typically, uniform in an orthonormal basis) and II„ 
is the corresponding empirical distribution based on the sample {Xi,... ,X„). We will use 
standard notations Pf = Kf{X,Y),Pnf = n~^ jyj=i Hf? = Pn9 = 


1.3 Estimation Methods 

Recall that the central problem in quantum state tomography is to estimate a large density 
matrix p based on the data (Xi, Fi),..., (X„, 1^) satisfying the trace regression model. 
Often, the goal is to develop adaptive estimators with optimal dependence of the estimation 
error (measured by various statistically relevant distances) on the unknown rank of the 
target matrix p under the assumption that p is low rank, or on other complexity parameters 
in the case when the target matrix p can be well approximated by low rank matrices. 

The simplest estimation procedure for density matrix p is the least squares estimator 
defined by the following convex optimization problem: 

p := argmin- ^ {Yj - {S,Xj)f . (3) 

S&Sm = l 

Since, for all S G Sm, ||<S'||i = tr(5) = 1, we have that 


p = jf := argmin 
S^Sm 


n 


n 

E 

i=i 


(Y,-(5,X,))2 + e||5|h 


e > 0. 


( 4 ) 
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Thus, in the case of density matrices, the least squares estimator p coincides with the ma¬ 
trix LASSO estimator with nuclear norm penalty and arbitrary value of regularization 
parameter e. The nuclear norm penalty is used as a proxy of the rank that provides a 
convex relaxation for rank penalized least squares method. Matrix LASSO is a standard 
method of low rank estimation in trace regression models that has been intensively stud¬ 
ied in the recent years, see, for instance, Candes and Plan (2011), Rohde and Tsybakov 
(2011), Koltchinskii (2011b), Koltchinskii et al. (2011), Negahban and Wainwright (2010) 
and references therein. In the case of estimation of density matrices, due to their positive 
semidefiniteness and trace constraint, the nuclear norm penalization is present implicitly 
even in the case of a non-penalized least squares estimator p (see also Koltchinskii 2013a, 
Kalev et al. 2015 where similar ideas were used). 

Note that the estimator p can be also rewritten as 

2 ” 

p:=argmin ||5|li2(n„) “ “ H ATj) . (5) 

SSSm L ^ i = l 

Replacing the empirical || • with the “true” || • (which could make 

sense in the case when the design distribution II is known) yields the following modified 
least squares estimator studied in Koltchinskii et al. (2011), Koltchinskii (2013a): 

r) n 

p:=argmin \\S\\l^^u) - ' (6) 

SeSm L a 

Another estimator was proposed in Koltchinskii (2011a) and it is based on an idea of 
using so called von Neumann entropy as a penalizer in least squares method. Von Neumann 
entropy is a canonical extension of Shannon’s entropy to the quantum setting. For a density 
matrix S G Sm, it is defined as £{S) ;= —tr(S'log 5). The estimator proposed in Koltchinskii 
(2011a) is defined as follows 


n ” 1 

p^;=argmin — ^(Yj-— (5, Vj))^-|-etr(51og 5) . 

SeSm j=i 


Essentially, it is based on a trade-off between fitting the model via the least squares method 
in the class of all density matrices and maximizing the entropy of the quantum state. Note 
that (7) is also a convex optimization problem (due to concavity of von Neumann entropy, 
see Nielsen and Chuang 2000) and its solution p^ is a full rank matrix (see Koltchinskii 
2011a, the proof of Proposition 3). It should be also mentioned that the idea of estimation 
of a density matrix of a quantum state by maximizing the von Neumann entropy subject 
to constraints based on the data has been used in quantum state tomography earlier (see 
Buzek 2004 and references therein). 


1.4 Distances between Density Matrices 

The main purpose of this paper is to study the optimality properties of estimator with 
respect to a variety of statistically meaningful distances, in the case when the underlying 
density matrix p is low rank. These distances include Schatten p-norm distances for p G 
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[1,2],/ but also quantum versions of Hellinger distance and Kullback-Leibler divergence 
that are of importance in quantum statistics and quantum information. A version of the 
(squared) Hellinger distance that will be studied is defined as 

H\Si,S 2 ) ;=2-2tr 

for 81,82 G Sm (see also Nielsen and Chuang 2000). Clearly, 0 < H'^{ 8 i, 82 ) < 2. In 
quantum information literature, it is usually called Bures distance and it does not coincide 
with tr(\/5T— \/5^)^ (which is another possible non-commutative extension of the classical 

Hellinger distance). In fact, H‘^{ 8 i, 82 ) < tr(v^ — , 81 , 82 G Sm, but the opposite 

— - 

inequality does not necessarily hold. The quantity try 8 ^ 828 ^ in the right hand side of 
the definition of is a quantum version of Hellinger affinity. 

The noncommutative Kullback-Leibler divergence (or relative entropy distance) K(-||-) 
is dehned as (see also Nielsen and Chuang 2000): 

iL( 5 i|| 52 ) := ( 5 i,log 5 i-log 52 ). 

If log 82 is not well-dehned (for instance, some of the eigenvalues of 82 are equal to 0) we 
set iL( 5 'i|| 5 ' 2 ) = + 00 . The symmetrized version of Kullback-Leibler divergence is defined as 

K{ 8 i- 82 ) := i^( 5 i|| 52 ) + i^( 52 || 5 i) = (5i - ^ 2 , log^i - log ^ 2 ). 

The following very useful inequality is a noncommutative extension of similar classical 
inequalities for total variation, Hellinger and Kullback-Leibler distances. It follows from 
representing the “noncommutative distances” involved in the inequality as suprema of the 
corresponding classical distances between the distributions of outcomes of measurements 
for two states 81,82 over all possible measurements represented by positive operator valued 
measures (see, Nielsen and Chuang 2000, Klauck et al. 2007, Koltchinskii 20IIa, Section 3 
and references therein). 

Lemma 2 For all 51,52 G Sm, the following inequalities hold: 

\\\Si-S2\\l<H^{Si,S2) < (K(5i||52)a||5i-52||i). (8) 

1.5 Matrix Bernstein Inequalities 

Non-commutative (matrix) versions of Bernstein inequality will be used in what follows. 
The most common version is stated (in a convenient form for our applications) in the 
following lemma. 

Lemma 3 Let X, Xi,..., Xn G be i.i.d. random matriees with KX = 0, cr^ := 
||EA^||oo and ||^||oo < U a.s. for some U > 0. Then, for all t > 0 with probability at 
least 1 — e 


1. Similar problems for estimators p, p and for Schatten p-norm distances with p € (2, -l-oo] are studied in 
a related paper by Xia and Koltchinskii 


— ^ Aj <2 ax 

rn ( ^ 


't + log( 2 m) , , t + log( 2 m) 
n ^ n 



7 



Koltchinskii and Xia 


The proof of such bounds could be found, e.g., in Tropp (2012). Other versions on 
matrix Bernstein type inequalities for not necessarily bounded random matrices will be also 
used in what follows and they could be found in Koltchinskii (2011b), Koltchinskii (2013a). 
A simple consequence of the inequality of Lemma 3 is the following expectation bound: 


E 




n 


i=i 


< 

rsj 


/log(2m) > , log(2m) 

<^x\i -V 


n 


n 


It follows from the exponential bound by integrating the tail probabilities. 

The paper is organized as follows. In Section 2, minimax lower bounds on estimation 
error of low rank density matrices are provided in Schatten p-norm, Bellinger (Bures) and 
Kullback-Leibler distances. In Section 3.1, sharp low rank oracle inequalities for von Neu¬ 
mann entropy penalized least squares estimator are derived in the case of trace regression 
model with bounded response. In Section 3.2, low rank oracle inequalities are established in 
the case of trace regression with Gaussian noise. In addition to this, in these two sections, 
upper bounds on estimation error with respect to Kullback-Leibler distance are obtained. 
In Section 3.3, they are further developed and extended to other distances (Bellinger dis¬ 
tance, Schatten p-norm distances for p G [1,2]) showing the minimax optimality (up to 
logarithmic factors) of the error rates of the least squares estimator with von Neumann 
entropy penalization. 


2. Minimax Lower Bounds 

In this section, we provide main results on the minimax lower bounds on the risk of estima¬ 
tion of density matrices with respect to Schatten p-norm (or, rather g-norm in the notations 
used below) distances as well as Bellinger-Bures distance and Kullback-Leibler divergence. 

Minimax lower bounds will be derived for the class Sr,m ■= {S' G Sm ■ rank(S') < r} 
consisting of all density matrices of rank at most r (the low rank case). We will start with the 
case of trace regression with Gaussian noise. Given that the sample {Xi,Yi ),..., (A„, Yn) 
satisfies Assumption 4 with the target density matrix p G Sm and noise variance cj|, let Pp 
denote the corresponding probability distribution. 

Note that Ma and Wu (2013) developed a method of deriving minimax lower bounds for 
distances based on unitary invariant norms, including Schatten p-norms in matrix problems, 
and obtained such lower bounds, in particular, in matrix completion problem. The approach 
used in our paper is somewhat different and the aim is to develop such bounds under an 
additional constraint that the target matrix is a density matrix. The resulting bounds 
are also somewhat different, they involve an additional term that does not depend on the 
rank, but does depend on q. Essentially, it means that the “complexity” of the problem is 
controlled by a “truncated rank” r A i, where r = rather than by the actual rank 

r. The upper bounds of Section 3.3 show that such a structure of the bound is, indeed, 
necessary. It should be also mentioned that minimax lower bounds on the nuclear norm 
error of estimation of density matrices have been obtained earlier in Flammia et al. (2012) 
(see Remark 11 below). 
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Theorem 4 For all q G [l,+oo], there exist constants c, c' > 0 such that, the following 
hounds hold: 


inf sup Pp<^ Up - p\\q > c 
P p&Sr,m f 


a^m 


2 f 


l/<? 


n 




> 


(9) 


and 


inf sup Pp|i42(p,p) > > c', 

P peSr,m I \ Vn ' '' J j 

inf sup FpiK{p\\p) > /\lH > c', 

p p&Sr,m I \ Vn ' '' J j 


( 10 ) 


( 11 ) 


where inf^ denotes the infimum over all estimators p in Sm based on the data (Xi, li),..., {Xn, YV) 
satisfying the Gaussian trace regression model with noise variance it|. 


Proof A couple of preliminary facts will be needed in the proof. We start with bounds 
on the packing numbers of Grassmann manifold Gk,u which is the set of all /c-dimensional 
subspaces L of the /-dimensional space M}. Given such a subspace L C with dim(L) = k, 
let Pl be the orthogonal projection onto L and let ^k,i ■= {Pl '■ L G Gk,i}- The set 
of all A:-dimensional projectors ^k,i will be equipped with Schatten g-norm distances for 
all q G [l,-|-oo] (which also could be viewed as distances on the Grassmannian itself): 
dq{Qi,Q 2 ) '■= WQi — Q 2 \\q,Qi,Q 2 £ ^k,i- Recall that the e-packing number of a metric 
space (T, d) is defined as 


D{T, d, e) = max < n : there are ti,... ,tn ^ T, such that min d{ti,tj) > e 

The following lemma (see Pajor 1998, Proposition 8) will be used to control the packing 
numbers of ^k,i with respect to Schatten distances dq. 


Lemma 5 For all integer 1 < k < I such that k < I — k, and all 1 < q < oo, the following 
bounds hold ^ ^ 

Q <DV(ik,i,d<i,eVh<(^^) ,£>0 (12) 

with d = k{l — k) and universal positive constants c, C. 


In addition to this, we need the following well known information-theoretic bound fre¬ 
quently used in derivation of minimax lower bounds (see Tsybakov 2008, Theorem 2.5). 
Let © = {00,01 ,... ,6 m} be a finite parameter space equipped with a metric d and let 
V := {Pq ; 6 G 0} be a family of probability distributions in some sample space. Given 
P, Q G P, let Ar(P||Q) := Eplog ^ be the Kullback-Leibler divergence between P and Q. 

Proposition 6 Suppose that the following conditions hold: 

(i) for some s > 0, d{6j, 6k) > 2s > 0, 0 < j < /c < M; 

(a) for some 0 < a < 1/8, ^ -^(IF’ej 11 ^ 00 ) < alogM 

i=i 
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Then, for a positive constant Ca, 


infsupPe{d(0,0) > s} > Cq, 
e 0e0 


where the infimum is taken over all estimators 6 G Q based on an observation sampled from 

P 0 . 


We now turn to the actual proof of Theorem 4. Under Assumption 4, the following 
computation is well known: for pi,p 2 G Sr^m-, 


A'(P„||P„) =E,„ log^(^Ai,r,......Y„.i; 


Ep,. E [ 

i=i 


{Y,-{p^,X^)f ^ {Y^-{p2,X^)f 


2cr| 


2cj| 


(13) 




It is enough to prove the bounds for 2 < r < m/2. The proof in the case r = 1 is simpler 
and the case r > m/2 easily reduces to the case r < m/2. We will use Lemma 5 to construct 
a well separated (with respect to dq) subset of density matrices in Sr,m- To this end, first 
choose a subset Vq C such that card(Ilq) > 2 ^'^ and, for some constant 

c', IIQl - Q 2 \U > c'(r - 1)1/9, Qi,g 2 G ,Qi Q 2 - Such a choice is possible due 

to the lower bound on the packing numbers of Lemma 5, For Q G Tq (note that Q can 
be viewed as an (m — 1) x (m — 1) matrix with real entries) and k G (0,1), consider the 
following m X m matrix 


S = Sq = 


1-K 0' \ 

0 )' 


(14) 


Note that S is symmetric positively-semidehnite real matrix of unit trace. It is straight¬ 
forward to check that it defines a Hermitian positively-semidehnite operator in of unit 
trace, and it can be identihed with a density matrix S G Sm- Clearly, S is of rank r, so, 
S G Srf Yn. 

We will take k := ci ^ with a small enough absolute constant ci > 0 and hrst 
assume that k < 1 (as it is needed in dehnition Equation 14). 

Let Sq := {5q ; Q G Vq} and consider a family of M -|- 1 = card(Ilq) > 2 C-i)(™--^) 
distributions {P 5 : S G 5'}. It is immediate that for Si = Sq^, S 2 = Sq^, Qi, Q 2 G Vq, Qi / 
Q 2 , we have 


11^1 - ^ 211 , = ^IIQl - Q 2 II, > c'K{r - I)!/-?-! 


r — 1 


^ , cj^m^/^(r — l)!/”^ ^ 


n 


n 


(15) 


with some constant c > 0, implying condition (i) of Proposition 6 with s 


2 y/n 
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We will now check its condition (ii) . In view of (13), we have, for all 5i = 5 qj ,52 = 
^Q2 ^ •Sqi 




iisi-s2iii 


UK 


2a|m2(r-l)2ll^^ ^2||i < 2^2^2(^ i 1)2 2c?m(r 1) (16) 


a , 


< am{r — 1)/ log(2)/4 < — (r — l)(m — r) log(2) < a log M, 


provided that constant ci is small enough, so, condition (ii) of Proposition 6 is also satisfied. 
Proposition 6 implies that, under the assumption k = ci ^ < 1, the following 

minimax lower bound holds for some c, c' > 0 : 


i/g 


inf sup Ppi \\p - p\\q > — I > c'. 

p p&Sr,r~- 1 . \/n ] 


(17) 


In the case when 


Cl 


(T^m 


3/2 


< 1 < Cl 


— 1 ) 


/n yn 

one can choose 2 <r' < r — 1 such that, for some constant C 2 > 0, 


C2 < Cl 


“ 1 ) ^ 1 


n 


For such a choice of r', it follows from (17) that 

inf sup Pp|||p-/o||q > c^^ 

^ P^^r'm 1 


(■/)l/<3 


n 


> c'. 


(18) 


The definition of r' implies that 




,3/2 x-l 


n 


Therefore, 


cj^m2 (r')^/'^ /cr^m^/2\i i/i 


/n V Vn ) 

and, since Sr'^m C bound (18) yields 


inf sup Pp<^ \\p-p\\q > c 

P pGSr.m ^ N 




n 


> inf sup 
p 


5-p||<? > c 




> c' 
(19) 

for some constants c, c' > 0. This allows us to recover the second term in the minimum in 
bound (9). Finally, in the case when ci > 1, the minimax lower bound becomes a 
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constant (and the proof is based on a simplified version of the above argument that could 
be done for r = 1). This completes the proof of bound (9) for Schatten g-norms. 

The proof of bound (10) for the Hellinger distance is similar. In the case r > 2, we will 
use a “well separated” set of density matrices 5' C Sr,m for g = 1 constructed above. We 

still use K := ci ^ assuming first that k G (0,1). For Sq^ G 5' with Qi ^ Q 2 , 
it follows by a simple computation and using bound (8) that, for some c" > 0, 


H 




1 

> -- 


4 (r — 1)' 


r||Ql-Q2||?> 


(c')^ — 1 ) 

-K> c 


n 


Repeating the argument based on Proposition 6 yields bound (10) in the case when k = 
^ and in the opposite case it is easy to see that the lower bound is a 


constant. 

Finally, bound (11) for the Kullback-Leibler divergence follows from (10) and the in¬ 
equality K{p\\p) > H‘^{p,p) (see inequality 8). 


Next we state similar results in the case of trace regression model with bounded response 
(see Assumption 3). Denote by Vr,m{U) the class of all distributions P of {X,Y) such that 
Assumption 3 holds for some U and E(y|A) = {pp,X) for some pp G Sr,m- Given P, Pp 
denotes the corresponding probability measure (such that (Xi, Ti),..., (X^, are i.i.d. 
copies of (X, y) sampled from P). 


Theorem 7 Suppose U > 2U. For all q G [l,-|-oo], there exist absolute constants c,c'>0 
such that the following hounds hold: 


inf sup Fp<\\p- pp\\q> c 
P P&VrMU) ^ 


\/n 


A 


Urnf’/'i \ 

n / 


i-i 


A 



> 


( 20 ) 


and 


inf sup Pp|i7^(p,/jp) > > c, 

^ PeVrMU) ^ V / J 


inf sup Pp|x(pp||p) > > c', 

P p&PrMU) ^ V vM ' ' / J 


( 21 ) 


( 22 ) 


where inf^ denotes the infimum over all estimators p in Sm based on the data (Xi, W),..., (X„, y„). 


Proof The proof relies on an idea already used in a context of matrix completion by 
Koltchinskii et al. (2011) (see their Theorem 7). We need the same family 5' C Sr,m of “well 
separated” density matrices of rank r as in the proof of Theorem 4, For a density matrix p, 
let (X, y) be a random couple such that X is sampled from the uniform distribution IT in 
£ and, conditionally on X, Y takes value +U with probability Pp{X) '■= \ + aiid value 
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—U with probability qp{X) ^ • Since U > 2U and \{p,X)\ < ||/9||i||X||oo < U, we 

have pp{X), qp{X) G [1/4, 3/4] (so, they are bounded away from 0 and from 1). Clearly, 
W,p{Y\X) = {p,X). Let Pp denote the distribution of such a couple and Pp denote the 
corresponding distribution of the data {Xi,Yi )^..., {Xn,Yn). Then, for all p G Sr,m-, Pp £ 
Pr,m{U). The only difference with the proof of Theorem 4 is in the bound on Kullback- 
Leibler divergence iL(PpJ|Pp2) (see Equation 13). It is easy to see that 

iL(Pp, ||Pp2) = nE (^pp,{X) log + qp,{X) log . (23) 

The following simple inequality will be used; for all a, 6 G [1/4, 3/4], 

a log ^ + (1 - a)log^^—^ < 12(a - bf. 

It implies that 

A'(P„||Fp,) < 3 »eLa:^i 4)! < ^llp, _ 

This bound is used instead of identity (13) from the proof of Theorem 4, The rest of the 
proof is the same. ■ 


Note that the proof requires the possible range [—17,17] of response variable Y to be 
larger than the possible range [—17,17] of Fourier coefficients {p,Ej),j = 1,... ,m?. This is 
not the case for standard QST model described in the introduction (see also the example 
of Pauli measurements) and it is of interest to prove a version of minimax lower bounds 
without this constraint, including the case when 17 = 17. The following theorem is a result 
in this direction. 


Theorem 8 Suppose Assumption 1 is satisfied and, moreover, for some constant 7 G (0,1), 

tr(Efc) < (1 — 7)17m, /c = l,...,m^. (24) 

Then, for all q G [l,+oo], there exist constants Cy,c'^ > 0 such that the following bounds 
hold: 

iiif sup ^php- pp\\q > ^ —^/ 1/m \ (25) 

P P&Pr,m.(U) I \ Vn \ Vn j /J 


sup Pp|i7^(/9,pp) > ^ /\\\ \ > d , 

Pr^u) I V Vn ' ' 7 J 


and 


inf 

P PePr,m{U) 


inf sup Pp|i7(pp||p) > ^ /\l] \ > c' , 

\Pr,m(U) I \ Vn J i 


(26) 


(27) 


P PePr,miU) 

where inf^ denotes the infimum over all estimators p in Sm based on the data (Xi, Li),..., (X„, Yn). 
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Proof The proof is based on the following lemma; 


Lemma 9 Suppose assumption (24) holds. Let K be a sufficiently large absolute constant 
(to be chosen later) and let m satisfy the condition ^ (which means that m > A.y 

for some constant A.y). Then there exists v G C™ with ||z;|| = 1 such that 


{EkV, v) 


< (1 - 7/2)f^, A: = 1,.. 




(28) 


Proof We will prove this fact by a probabilistic argument. Namely, set v := , Em), 

where Ej = ±1. We will show that there is a random choice of “signs” Ej such that (28) 
holds. Assume that Ej,j = are i.i.d. and take values ±1 with probability 1/2 

each. Let ;= (ajj^)jj=i,...,m- For simplicity, assume that is a symmetric 

real matrix (in the complex case, the proof can be easily modified). We have 


{EkV,v) = — V 

m ^ 




2 = 1 


H- 

m 




(fc) 

al/EiEj 


ti^Ek) 


m 


I ^ (^) 

H- a^A EiEj. 


It is well known that 



2\\Eu\\l = 2. 


Moreover, it follows from exponential inequalities for Rademacher chaos (see, e.g.. Corollary 
3.2.6 in de la Pena and Cine 1999) that for some absolute constant A > 0 and for all t > 0, 
with probability at least 1 — 


{EkV, v) 


tr(£’fc) 

m 


1 

m 




Kt 

m 


Taking t = 2 log m and using the union bound, we conclude that with probability at least 

1 = 1 _ A > 0 , 

m ’ 


max 


{EkV, v) 


tijEk) ^ Alogm ^ K log m ^ 7 ^ 
m ~ m ~ ffim ~ 2 ’ 


where we also used the fact that U > m Thus, there exists a choice of signs Ej such 
that 


max 

l<k<m? 


\{EkV,v 

which, under condition (24), implies (28) 


< max 

l<k<m 


tr(Afc) 


m 




We set Cl := v (where v is the unit vector introduced in Lemma 9) and construct an 
orthonormal basis ei,...,em- Assume that matrices Sq defined by (14) represent linear 
transformations in basis ei,..., • Then we have 

{SQ,Ek) = (1 - K){EkV,v) + ^^j^{Q,Ek). 
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Therefore, 

{SQ,Ek) < {1 -k) {Ekv,v) +J^\\Ek\\oo\\Q\\i < {1 -k){1-j/2)U+kU = {1-{1-k){j/2))U. 

r — I 

Assuming that k < 1/2, we get 

{SQ,Ek)\<{l-^/m k = l,...,m\ (29) 

The rest of the proof becomes similar to the proof of Theorem 7 (with U = U). Namely, 
bound (29) implies that, for p = Sq and X being sampled from the orthonormal ba¬ 
sis {El,... ,E^ 2 }, probabilities Pp(X) and qp{X) are bounded away from 0 and from 1 ; 
Pp{X),qp{X) G [ 7 / 8 ,1 — 7 / 8 ]. This allows us to complete the argument of the proof of 
Theorem 7. ■ 

Theorem 8 does not apply directly to the Pauli basis since condition (24) fails in this case. 
Indeed, by the definition of Pauli basis, U = and tr(Ai) = ^/m = Um > (1 — 'y)Um. 

Note also that ti{Ej) = 0,j = 2,... ,m^. Thus, for Pauli basis, Ei is the only matrix for 
which condition (24) fails. However, for this matrix {p,Ei) = m“^/^tr(p) = m = t/ for 
all density matrices p G Sm- This immediately implies that pp{Ei) = 1 and qp{Ei) = 0 for 
all p G Sm and, as a result, the value X = Ei does not have an impact on the computation of 
Kullback-Leibler divergence in (23). For the rest of the matrices in the Pauli basis, condition 
(24) holds implying also bound (28). Therefore, if A / Fli, we still have that, for p = Sq, 
Pp{X),qp{X) G [ 7 / 8,1 — 7 / 8 ], and the proof of Theorem 7 can be completed in this case, 
too. Note also that, given X sampled from the Pauli basis, the binary random variable Y 
taking values EU = with probabilities Pp{X) and qp{X), respectively (this is exactly 

the random variable used in the construction of the proof of Theorem 7) coincides with an 
outcome of a Pauli measurement for the system prepared in state p. These considerations 
yield the following minimax lower bounds for Pauli measurements. 

Theorem 10 Let {Ei ,..., Em^} be the Pauli basis in the space Mm of m x m Hermitian 
matrices and letXi,... ,Xn be i.i.d. random variables sampled from the uniform distribution 
in {El ,..., A^ 2 }. Let Yi,... ,Yn be outcomes of measurements of observables Xi,..., Xn 
for the system being identically prepared n times in state p. The corresponding distribution 
of the data {Xi,Yi),..., {Xn,Yn) will be denoted by Pp. Then, for all q G [l,-|-oo], there 
exist constants c,c'>0 such that the following bounds hold: 

inf sup Ppj H^{p, p) > /\ iH > c', (31) 

p P&Sr,m I \Vn ' ' / J 

and 

inf sup Pp|a:(/j||p) > c("^/\lH > c', (32) 

P P&Sr,m I \Vn' / J 

where inf^ denotes the infimum over all estimators p in Sm based on the data (Ai, li),..., (A„, Yn). 
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Remark 11 Minimax lower bounds on nuclear norm error of density matrix estimation 
close to bound (30) for q = I (but for a somewhat different “estimation protocol” and stated 
in a different form) were obtained earlier in Flammia et al. (2012). This paper also eontains 
upper bounds on the errors of matrix LASSO and Dantzig selector estimators in the nuclear 
norm matching the lower bounds up to log-factors. 

Remark 12 It is easy to see that, if eonstant 7 G (0,1) is small enough (namely, 7 < 
1 — then, in an arbitrary orthonormal basis {Ei ,..., E^ 2 }, there is at most one matrix 
Ej sueh that |tr(£^j)| > (1 — 'y)Um. Indeed, note that ti{Ej) = [Ejffm)- Since 

m? 

^ ) {Ej , Im) — 11 Im 112 ~ ^ 
i=i 


and U'^m > I, we have 


card({j : \{Ej,I,n)\ > ^ _ X2[/2^2 Y.{^E^rn? 

\ I) j^l 

m 11 

“ (1 —7)2[/2 j ^2 (1 —~ (1 — 7)2 ^ ’ 

provided that 7 < 1 

Remark 13 It will be shown in Seetion 3.3 that the minimax rates of theorems f, 1, 8 
and 10 are attained up to logarithmic factors for the von Neumann entropy penalized least 
squares estimator. 


Remark 14 Similar minimax lower bounds could be proved in certain classes of “nearly 
low rank” density matriees. Consider, for instanee, the following class 



S eS„ 


m 

i=i 


(33) 


for some d > 0 and p G [0,1], where Xi{S) > • • • > Xm{S) denote the eigenvalues of S. This 
set consists of density matrices with the eigenvalues decaying at a certain rate (nearly low 
rank case) and, for p = 0, d = r it coincides with Sr^m- E turns out that minimax lower 
bounds of theorems 4 o,nd 7 hold for the elass Bp{d;m) (instead of Sr,m) with r replaeed by 


r := r{T, d, m,p) = dr ^ A m, 

where r := in the case of traee regression with Gaussian noise and r := in 

the case of trace regression with bounded response. These minimax bounds are attained up 
to logarithmie faetors for a slightly modified von Neumann entropy penalized least squares 
estimator. 

Note that, for p G Bp{d,m) with eigenvalues Xi{p) > ■ ■ ■ > Xm{p), we have Xj{p) < 
= l,...,m. Therefore, for j > r, Xj{p) < t. Note also that r characterizes the 
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minimax rate of estimation of p £ Sr,m in the operator norm for any value of the rank r 
(see bound (9) for q = +oo; the corresponding upper bound also holds for the least squares 
estimator up to a logarithmic factor, see Xia and Koltchinskn). Roughly speaking, r is a 
threshold below which the estimation of eigenvalues Xj{p) becomes impossible and r can be 
viewed as an “effective rank” of nearly low rank density matrices in the class Bp{d,m). 

3. Von Neumann Entropy Penalization: Optimality and Oracle 
Inequalities 

The goal of this section is to study optimality properties of von Neumann entropy penalized 
least squares estimator p^ defined by (7). In particular, we establish oracle inequalities for 
such estimators in the cases of trace regression with bounded response (Subsection 3.1) 
and trace regression with Gaussian noise (Subsection 3.2), and prove upper bounds on their 
estimation errors measured by Schatten g-norm distances for q £ [1,2] and also by Hellinger 
and Kullback-Leibler distances (Subsection 3.3). 

3.1 Oracle Inequalities for Trace Regression with Bounded Response 

In this subsection, we prove a sharp low rank oracle inequality for estimator defined by 
(7). It is done in the case of trace regression model with bounded response (that is, under 
Assumption 3). The results of this type show some form of optimality of the estimation 
method, namely, that the estimator provides an optimal trade-off between the “approx¬ 
imation error” of the target density matrix by a low rank “oracle” and the “estimation 
error” of the “oracle” that is proportional to its rank. Sharp oracle inequalities (in which 
the leading constant in front of the “approximation error” is equal to 1, so that the bound 
mimics precisely the approximation by the oracle) are usually harder to prove. In the case 
of low rank matrix completion, the first result of this type was proved by Koltchinskii et al. 
(2011) for a modified least squares estimator with nuclear norm penalty. A version of such 
inequality for empirical risk minimization with nuclear norm penalty (that includes matrix 
LASSO) was first proved by Koltchinskii (2013b). Low rank oracle inequalities for von 
Neumann entropy penalized least squares method with the leading constant larger than 1 
were proved by Koltchinskii (2011a). The main result of this section refines these previous 
bounds by proving a sharp oracle inequality, improving the logarithmic factors and remov¬ 
ing superfluous assumptions, but also by establishing the inequality in the whole range of 
values of regularization parameter e > 0 (including the value e = 0, for which p^ coincides 
with the least squares estimator p). In addition to this, for a special choice of regularization 
parameter e, the theorem below also provides an upper bound on the Kullback-Leibler error 
K{p\\p^) of p^ that matches the minimax lower bound (22) up to log-factors (and “second 
order terms”). It turns out that, for this choice of e, the estimator satisfies exactly the same 
low rank oracle inequality as the best inequalities known for LASSO estimator and minimax 
optimal error rates are attained for p^ also with respect to Hellinger distance and Schatten 
g-norm distances for all q £ [1,2] (see Section 3.3). For simplicity, it will be assumed that 
constants U in Assumption 1 and U in Assumption 3 coincide (in the upper bounds, one 
can always replace U and U hy U V U). 
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Theorem 15 Suppose Assumption 3 holds with constant U = U and let e G [0,1]. Then, 
there exists a eonstant C > 0 sueh that for all t > 1 with probability at least 1 — 


11/5' /plli 2 (n) + c(raiik{S)m‘^e‘^ log^(mn) 


Wfp^- fpWhiu) ^ infse5„ 

^ ^7-2 rank(S’)mlog(2m) jj 2 t+log log2 ( 2 n) 

In partieular, this implies that 

ll/p" - /plli 2 (n) ^ ^ T:ank{p)m^£^\og^{mn) 


^ ^ 7-2 rank(p)m log( 2 m) jj2 t+log log 2 ( 2 n) 


Moreover, if 


e := 


log(mn) 


U\ 


/log(2m) y ^ 2 log( 2 m) 


nm 


n 


then, with some eonstant C and with probability at least 1 — e 


-t 


/plli2(n) - ^ 


jj2 rank(p)m log( 2 -m) | ^ Y U‘^ ^ log( 2 m) 


+u- 


2 t+log log 2 (2n) 


(34) 


(35) 


(36) 


and 


Kipm < cu 


k{p)m^/^^log{2m) log(mn) M ^ m \og{Tm) 


+ 


y/n 

m (*+log log 2 ( 2 w)) log(mn) 
^ ^log( 2 m) 


(37) 


Proof The following notations will be used in the proof. Let i{y,u) := {u — y)‘^,y,u G M 
be the quadratic loss function. For / : denote 

{£ • f){x, y) = {f{x) - yf, (/ • /)(x, y) = 2(/(x) - y) 


and 

n 

P{i . /) = E(y - fiX)f, Pn{£ . /) = n-' - f{Xj)f. 

For A G Hm, let fA{x) = {A, x),x G 111^. Since for density matrices S G Sm, ||5'||i = tr(5) = 
1, the estimator p = can be equivalently defined by the following convex optimization 
problem: 


p = argmin5g5^L„(S’), 


Ln{S) := 


Pni£*fs)+eti{SlogS)+e\\Sh 


for an arbitrary e > 0. 

The following lemma will be crucial in the proofs of Theorem 15 as well Theorem 19 in 
the following subsection. Note that it does not rely on Assumption 3, only Assumptions 1 
and 2 are needed. 
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Lemma 16 Suppose Assumptions 1 and 2 hold. Let 5 G (0,1) and S := {1 — 5)S' + 5-^, 
where S' G Sm, rank(5') = r and Im is the m x m identity matrix. Then the following 
hound holds: 

\\fp- fp\\l2(n) + Wf>- fs\\l^(ji)+eK{p-S)+£\vi{p)\^ 

< 11/5 - /p|li 2 (n) + \og^{m/5) + rm^e^ (38) 

+AE5+{P-Pn){e*fp){fp-fs). 

Lemma 16 will be often used together with the following simple bound; 

ll/s — /plliain) ~ i^ll*^ “ ^lli — 

^\\S'- p\\l + ^\\s' - phWS' -Sh + ^\\S'- 5||i (39) 

< Wfs' - /plli 2 (n) + ^ + - ll/s'' “ /plli 2 (n) + 

Together, they imply that 

Wfp - fpWhiu) + hWfp - fsWl^iu) + S) + £\\vj;ip)\\^ 

< \\fS' - /plli 2 (n) + log^{m/S) + rni^e^ (40) 

+4e5 +^^ + {P- PnW • /p)(/p - fs). 

We will now give the proof of Lemma 16, 

Proof By standard necessary conditions of extremum in convex problems, we get that, for 
all S G Sm and for some V G 9||p||i, 

fn(^' • /p)(/p - fs) + £(log p, p - S) +e(V,p-S} < 0 

(see, e.g., Aubin and Ekeland 2006, Chapter 2, Corollary 6; see also Koltchinskii 2011b, 
pp. 198-199; for the computation of derivative of the function tr(SlogS'), see Lemma 1 in 
Koltchinskii 2011a). Replacing in the left hand side P by Pn, we get 

P(^' • fp){fp - fs) + e{logp, p-S)+ s{V, p-S)<{P- Pn){e . /p)(/p - fs). 

It is easy to check that for the quadratic loss 

P{^' • fp){fp - fs) = P{i • fp) - P{^ • fs) + ll/p - /s|li 2 (n), 

implying that 

P{i . fp) - P{£ . fs) + ll/p - /s|li 2 (n) + e(log/5, P-S)+ P - S) 
<(P-PO(^'-/p)(/p-/s). 

Also, for the quadratic loss, 

P(£./)-P(^./,) = ||/-/,||i^(n)- 
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Therefore, 

ll/p — /plli2(n) + ll/p “ /s’lli2(n) + P — S) + 7 P — S) 

< 11/5 - fpWUu) + iP- PnW • fp)ifp - fs)- 

Recall that we have set S = {1 — 6)S' + where S' G Sm, rank(5') = r, 6 £ (0,1). 
Clearly, 

{V, S-S') < llRlIooll^ - 5'||i < I|5 - 5'||i = 5 S'-— <26, 

m ^ 

where we used the fact that ||R||oo < 1 for V G 9||^||i. This implies 

ll/p - /pIlLcn) + ll/p - fs\\l^(u)+^i^^SP,p-S)+e{V,p-S') (41) 

< Wfs - fpWUu) + +iP- PnW • fp){fp - fs)- 

Recall formula (2) for the sub differential of nuclear norm. Let L = supp(S'^). By the 
duality between the operator and nuclear norms, there exists M G with ||M||oo < 1 
such that 

{Vi{M),p-S') = {M,Vi{p-S')) = \\vHp-S')\1 = \\pHp)\1- 
With V = sign(S") G 5||5'||i, by monotonicity of subdifferential, we get that 

(sign(5'),/5 - 5') + \\Pi{p)\l = {V,P - S') <{V,p- S'). (42) 

In addition to this, we have 

{logp,p- S) = {logp-logS,p- S) + {logS,p- S) = K{p;S) + {logS,p- S). (43) 

Substituting (42) and (43) into (41), we get 

ll/p - fpWUu) + ll/p - fsWUu) + S) + 4Piip)\l 

< ll/5-/p|li2(n)+^(log5',5-p)+e(sign(5'),5"-p) (44) 

+2s6 + {P-Pn){i'*fp){fp-fs). 

The following bound on g(sign(S''), S' — p) is straightforward; 

g(sign(5'),5" - p) < e{sign{S'),S - p) + e||sign(5')||oo||5’ - 5'||i 

< e||sign(S’')|| 2 ||S’ - p ||2 + 2s6 < ey6fm\\fs - /p||i, 2 (n) + 2g(5 (45) 

< + l\\fs - fpWl^^u) + 

A similar bound on e(log S,S — p) is only slightly more complicated. Suppose S' has the 
following spectral representation; S' = J2k=i ^kPk with eigenvalues Xk G (0,1] (repeated 
with their multiplicities) and one-dimensional orthogonal eigenprojectors P^. We will extend 
Pj,j = 1,..., r to the complete orthogonal resolution of the identity Pj,j = 1,... ,m. Then 

/ in’” ™ 

log 5 = log (1 - 6)S' -£6—)=J2 log((l - S)Xj + 6/m)Pj + ^ log(<5/m)R,- 

V / j=l j=r+l 
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= + (1 - 5)m\j/5^Pj +\og{5/m)Im 

i=i 

and 

(log5,5 - p) = {^^log(^l + (1 - 5)m\j/5^Pj,S - +log{6/m){Im, S - p) 


(H log(l + (1 - 5)m\j/5^Pj,S - p 

\j=i 


where we used the fact that {Im-, S — p) = tr(5) — tr(/5) = 0. Therefore, 


e(log S, S — p) < e 


\\S-P\\2 


E,"=ilog(l + (l-5)mA,/5)p, 

/ \ 1/2 
= em(^X;i=ilog^(l + (1 - 6)mXj/5'^j ||/s - /p||L2(n) 

< ey/?m\og{m/5)\\fs - fp\\L 2 {u) < rm^e^\og‘^{m/6) + IWfs - fp\\l^(^n)^ 
where it was used that for Xj G [0,1] 


log(^l + (1 - S)mXj/5^ < log(^ 


(5 + (1 — 5)m 

5 


< log(m/(5). 


(46) 


Substituting bounds (45) and (46) in (44) we easily get bound (38), as claimed in the lemma. 


We will also need the following simple lemma that provides a bound on K{S'\\p) in 
terms of K{S\\p). 

Let 

h{S) := 51og^ + (1 - '^)log 

Observe that 

h{5) = (51ogi + (1 - (5)log^l + + (1 - 

(this bound will be used in what follows). 

Lemma 17 Let 5 G (0,1), S' G Sm with rank(5') = r and S = {1 — 5)S' + Then, for 
any U G S^, 

Kis'im < 

Proof The following identities are straightforward: 

K{S\\U) = tr(5(log S - log U)) 

= (1 - 5)tr(5'(log S - log U)) + (5tr((/m/m)(log S - log U)) 

= (1 - S)tT{S'{\ogS' - logU)) + (1 - 5)tr(5'(log5- log5')) 
+5tr((/m/m)(log S - log{Im/m))) + 5tr((/m/m)(log(/m/m) - log U)) 

= (1 - 6)K{S'\\U) - (1 - 6)K{S'\\S) + 6K{Im/m\\U) - 5iL(/^/m||5). 
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Since K{Im/m\\U) > 0, it follows that 

K{S'\\U) < + A-(S'IIS) + ^A'(W™l|S). (47) 

Assuming that S' has spectral representation S' = J2^j=i ^jPj with eigenvalues \j > 0 and 
one-dimensional projectors Pj, we get 

-A'(S'||S) = tAUog^A^%^ 

j = l 


= -S + > log(l - = log(l - 

j=i V j=i 


implying that Ar(5'||5) < log On the other hand, 


K{Im/m\\S) 


Ivi 

m ^ (1 — 5)Xj + S/m 


< — ^ log 


m 


i=i 


1 

~5 



Substituting these bounds in (47) yields the result. 


To complete the proof of Theorem 15, we need to control the empirical process (P — 
Pn){i' • fp){fp — fs) in the right hand side of bound (38). Our approach is based on the 
following empirical processes bound that is a slight modification of Lemma 1 in Koltchinskii 
(2013b). As before, we assume that S = (1 — 5)S' + 5^ with S' G Sm, rank(5') = r. We 
will set S := 

Let := where Ej are i.i.d. Rademacher random variables (that is, £j 

takes values -|-1 and —1 with probability 1/2 each) and {£j},{Xj} are independent. 


Lemma 18 Given 61,82 > 0, denote 


an{Si, 62 ) := supj {Pn- P){i' • fA)ifA- fs) : A G 5m, H/a-ZsIIlsT) < II^l A||i < 62 
Let 0 < < 5/", 0 < 62 < 6 ^. For t > 1, denote 

i := t+ log([log 2 ((j)^ /5//)] -^ 2 ) +log([log2((5j/5^)] + 2 ) +log3. 

Then, with probability at least 1 — e~', for all G [5)", , <^2 G [ 6 f , 62 ], 


a, 


fSi, 62 )<CiUn'^e\\oo(yfm 6 i +62 + s)+C2U6i\j- + C:iU^-, 


where Ci, 6 * 2 , Ca > 0 are constants. 
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We will use this lemma to control the term (P — Pn){(-' • fp){fp 
:= ll/p - /s||L 2 (n) and 82 := ||7^l/|| i- Define also 


5t := <^ 2 + := 1 , <5r = 52 

m 


1 

mn’ 


fs) in bound (38). Let 


so that i <t + 2 log(log 2 (mn) + 3) + log 3. It is easy to see that 5i < 6 f and 82 <5^. If, in 
addition, 5i > 5)~, ^2 > 5^, the bound of Lemma 18 implies that with probability at least 
1 — 

(P - P„)(/ . fp){f~p - fs) < an(5i, 52) 


< C'i[/E||H,||oo(V/m5i + 52 + 5 ) + C2U8i^^ 
If e > Cl I/E 11 He 11 00 , the last bound implies that 


n 


{ p - p ^){ e . fp ){ fp - fs ) 

< ill/p - fs\\l^(n)+rm^£‘^ + e\\Vip\\i+e8 

+\\\U-fs\\l,(u) + iCl + C,)U% 


(48) 


Substituting this bound in the right hand side of (40), we get 

ll/p ~ /plli2(n) 

< 11 / 5 ' - /plli 2 (n) + log^ (m/5) + 2rm‘^e'^ (49) 

+5M + C[/ 2 i + 

where C := Cf + C 3 . 

In the case when 5i = ||/p - / 5 ||L 2 (n) < ^ or 52 = UPf/||i < 5^ = ^, we can 

replace the terms jll/p —/ 5 '||| 2 (n) /Ill ia bound (48) by their respective upper bounds 

~ ^2 ~ which would be smaller than CU'^^ for large enough C > 0, 

so bound (49) still holds (recall that U > Note also that ^ 

Thus, increasing the value of constant C, one can rewrite (49) in a simpler form as 


\\fp-fp\\l,pa)+^K{p-S) 

< 11 / 5 ' - /plli 2 (n) + log^ (m/5) + 2rm?e^ (50) 

+5e5 + C[/ 2 i 


The following expectation bound is a consequence of a matrix version of Bernstein inequality 
for 11 He 1100 (it follows by integrating out its exponential tails); 


Ell¬ 


'S 11 00 ^ 4 : 


/log( 2 m) y ^log( 2 m) 


nm 


n 


(it is also used in this computation that, in the case of uniform sampling from an orthonormal 
basis, a‘^x = l|E^^lloo = a simple fact often used in the literature; see, e.g., Koltchinskii 
2011a, Section 5). Let 


e := D'U 


log( 2 m) 


nm 
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for some constant D'. If D' is sufficiently large and 


^^ log(2m) 

n 


< 


/log( 2 m) 


nm 


(51) 


then the condition e > Cit/EHH^Hoo is satisfied and bound (50) holds with probability at 
least 1 — e“*. Moreover, s5 <£,! 6 <£>/ implying that the term 5s5 in (50) can be 

dropped at a price of further increasing the value of constant C. 

If (51) does not hold, we still have that 


ll/p /pIlLin) 


n 


Recalling that t < t + 2log(log 2 (mn) + 3) and \og{m/5) < log(mn), we deduce from (50) 
that with some constant C and with probability at least 1 — 


ll/p - /pIlLcn) ^ Wfs' - /plli2(n) + rmVlog2(mn) 

^ jj2 rmlog(2m) jj2 t+log(log 2 (mn)+3) 


(52) 


Note that, for n > 2, 

log(log 2 (mn) + 3) = log(log 2 ( 4 m) + log 2 ( 2 n)) < loglog 2 ( 4 m) + loglog 2 ( 2 n), (53) 

since log 2 ( 4 m) + log 2 ( 2 n) < log 2 ( 4 m) log 2 ( 2 n). Since also, for r > 1, 

^^ 2 ^ + log log 2 ( 4 m) ^ ^^ 2 ^"^ log( 2 m) 

we can replace in bound (52) the term *+i°g(i°g 2 (™'^)+ 3 ) term [/ 2 £±l£li£l 2 (^ 

(increasing the value of the constant C accordingly). This yields bound (34) of the theorem. 
For S' = p, it yields bound (35), and, moreover, for S' = p and 5 = (1 — 5)p + with 
^ , bound (50) also implies that 

eK{p; S) < rank(/ 9 )m^e^ \o^{m/5) + 2rank(/))m^e^ (55) 

+5e<5 + C[/ 2 |. 


We will now take 


e := D' 


'^^J log{2m) y ^2MM 


V nm ’ n 

for a large enough constant D' so that e > CiI/EHHeHoo- Assume that 


1 


e := 


log(mn) 


'^^J \og{2m) y^^2 ^og{2m) 


nm 


n 
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As before, the term e6 in bound (55) will be absorbed by the term CU^— with a larger 
value of C and also 

rank(/9)m e log (m/5) rank(/))m e U - 1 y U - 

n V ’ n 

As a result, taking into account (53), (54), bound (55) can be rewritten as follows; 


eK{P;S) < CU^ 


rank(p)m log(2m) [ i \ / Ty- 2 log(2m) 

n I V n 

I+log loga (2n) 


(56) 


+ 


Using the bound of Lemma 17 along with the bound 

, .jx , . /jx 1 1 / 2 2 \ ^ rr /m (t + loglog 2 (2n)) log(mn) 

h{5) < 5\og{e/5) = -^\og{em^n^) < 

V n ylog(2m) 

we easily get that (37) holds. 


3.2 Oracle Inequalities for Trace Regression with Gaussian Noise 

In this subsection, we establish oracle inequalities for the von Neumann entropy penalized 
least squares estimator in the case of trace regression model with Gaussian noise (As¬ 
sumption 4). Unlike in the case of Theorem 15 of the previous section, our aim is not 
to obtain sharp oracle inequality, but rather to get a clean main term of the random error 
bound part of the inequality, namely, the term q-| rank(S')mft+iog(2m)) inequality (58) below. 

Note that this term depends only on the variance of the noise f7|, but not on the constant U 
from Assumption 1 (the constant U is involved only in the higher order 0(n~‘^) terms of the 
bound). Note also that there are no constraints on the variance cj| that could be arbitrarily 
small, or even equal to 0 (in which case only higher order terms are present in the bound). 
This improvement comes at a price of having the leading constant 2 in the oracle inequality 
and also of imposing assumption (57) that requires the regularization parameter e to be 
bounded away from 0 (again, unlike Theorem 15, where it could be arbitrarily small). As 
in the previous section, we also obtain a bound on Kullback-Leibler divergence K[p\\p^). 


Theorem 19 Let t>\. Suppose 


s G 


DU' 


21 + log^ mlog^ n Dia^ /t-|-log(2m) , , 2 ^ + ^ 


n ’ log (mn) 


nm 


y Du'^ 


n 


(57) 


with large enough constants D, Di > 0. There exists a constant C > 0 such that with 
probability at least 1 — 


llfr - fPlm < jipL[211/s - fA\lm + c(p 


2rank(5)m(t + log(2m)) 


n 


2,,2rank(5)m^(t + log(2m))^ log(2m) 4rank(5)m^(t + log^ mlog^ n)^ log^(mn) 

+ 0-^0 -^-h U -^- 






(58) 
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In particular, 


ll/p® /plli2(n) — ^ 


a: 


2 rank(p)m(t+log(2m)) 


(59) 


I 27 'r 2 i’ank(p)m^(t+log( 2 m))^ log( 2 m) . t 7-4 rank(p)m^ (t+log® m log^ n)^ log^ (mn) 
-\-(7 c U -H- + U - ^5 - 


Moreover, if 


e := 


Dia^ /1 + log(2m) \ / ^,..2 ^ + log^ ^ log^ ^ 


log(mn) V nm 


y Du^ 


n 


for large enough constants D,Di, then with some constant C and with the same probability 
both (59) and the following bound hold: 


Kipim < c 




rank(p)m®/^ (t+log(2m log( ran) 


(60) 


+cri 


2 rank(p)m^ (t+log(2m)) log(2m) _|_ jj2 rank(p)m^ (t+log® m log^ n) log^ (mn) 


Proof As in in the proof of Theorem 15, we rely on Lemma 16, but we use a different 
approach to bounding the empirical process (P — Pn){i'* fp)ifp — fs)- The following identity 
follows from the definition of quadratic loss i 

(/ • f){x,y){f{x) - fs{x)) = 2(/(x) - fsix)f + 2(/s(x) - y){f{x) - fs{x)) 
and it implies that 

(P - Pn){e . /p)(/p - fs) = -2{Pn - P){fp - fs? - 2{E,p- S) (61) 


where 

n 

E := J2(fsi^j) - - T)X. 

i=i 

We will bound {Pn — P){fp — /s)^ in representation (61) as follows; 


{p^-p){fp-fsY < \\p-s\\iY 


2o /ll/p -/s||L2(n)\ 


'-^lli /’ 


(62) 


where 


/3n(A) := supj [Pn - P)ifA) 


: A € 


< 1; ll/A||L2(n) < A 


The next lemma provides a bound on /3n{^)- Its proof is somewhat involved and it will 
not be given here. It is based on Rudelson’s Loo(Pn) generic chaining bound for empirical 
processes indexed by squares of functions and on the ideas of the paper by Guedon et al. 
(2008) combined with Talagrand’s concentration inequality (see also Aubrun 2009, Liu 
2011 and Theorem 3.16, Lemma 9.8 and Proposition 9.2 in Koltchinskii 2011b for similar 
arguments). 
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Lemma 20 Given 0 < (^ < and t > 1, let 

t := t + log(log2((5+/(5") + 3 ). 


Then, with some constant C and with probability at least 1 — e ^ the following bound holds 
for all A G ; 


^n(A) < C 


AU- 


log^/^ m log n 9 log^ m log^ n . t 

— --h U — -2-h AU\ - ■ 


(63) 


We will use Lemma 20 to control /3n{A) for A ;= . Let (5+ := ^ and 

•= ih.- choice, f < t + log(log 2 n + 3). Note that for A = ll/AlUain) = 

= J’*'. If also ||/A||L 2 (n) A d~, then we can substitute bound (63) on 
ldn{A) into (62) that yields; 


{Pn - P){fp - fsf 


< C 


\\fi> - fshmiWp - 


+||p - + II/, _ /s||,,|n,||p - s\\,ufl 

+\\p-sfiU'‘l 

< ill* - AliLm + 8(C" + C/8)C/2!2 £i!^||p - Sll? 
+ill* - AllL(n) + 8(C" + C/8)f/U||/ _ S||2 

< ill* - *IIL,n, + - S||f, 


(64) 


where C := 8(C'2 + (7/8). If, on the other hand, ||/A||L 2 (n) then \\fp-fs\\L 2 iu) 

in the above bound can be replaced by —«S'||i and the proof that follows only simplifies 

since 


1 

Te 


ll/p — /s|li2(n) 


1 1 

16 m?n^ 


||)5-5||?<lc/2 


log^ m log^ n + t 
n 


11/5-^11 


2 

1 - 


Another term in the right hand side of representation (61) to be controlled is (H, p — S). 
Note that H = Si + S 2 , where 

n 

‘^1 

i=i 

and 

n 

H2 := n-i - HfsiX) - fpiX))X. 

i=i 

Recall that S = {1 — 5)S' + 5^ with S' G Sm, rank(5') = r, supp(5') = L and 6 = ^1^2 ■ 
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The term with Hi is controlled as follows; 

i^uP-S) 

< {VLi^i),p- 5')| + - 5'))| + \{Vj;iE,),S' - S) 

< \\Vl{^i)\\2\\p - 5'||2 + + P^(Hi) ||5' - 5||i 

CXD 

< 2\/^m||Hi||oo||/p - fs\\L2{u) + Pi||oo||7^l (^)lli +4(5||Hi||oo 

< 32rm2||Hi||^ + ^\\fp - /s||i,(n) 

+ ||Si||oo||'Pl (^)lli + 4(5||Hi||oo. 

We also have 

{E2,p-S)\ < ||H2||oo||/5-5||i < ||H2||oo||/5-5'||i + ||H2||oo||5'-5||i 

< ||S 2 ||oo||^ — 'S"||i + 2 ( 5 ||H 2 ||oo- 

Thus, 

{E,p-S) < 32rm2||Hi||^ + ^\\fp - fs\\l^(^u) 

+ II‘='i||oo||’Pl (p)||i + 4(5||r,i Iloo + ||‘=‘2||oo||^ ~ 'S"||i + 25||;=,2 ||oo- 
It follows from (61), (64) and (67) that with some constant C 


{p-p^){e.fp){f~p-fs)< 

ill/p - /5llL(n) + - 5||f 

+64rm2||Hi||^ + 2||Hi||oo||T’i (^)lli +8(5||Hi||oo 

+ 2 ||‘=‘ 2 ||oo||^ ~ •S'^lli + 4(I||;=;2lloo- 


(65) 


( 66 ) 


(67) 


( 68 ) 


This bound will be substituted in (38). Note that, if assumption (57) on e holds with a 
sufficiently large constant D, then we have 

e > + ^ 

“ n 

(this follows from the fact that t <t + log(log 2 n + 3) < t + c log^ m log^ n for some constant 
c > 0). Assume also that e > 4||Hi||oo and recall that K{p]S) > 4||p — 5||i (see inequality 
8). Taking all this into account, (38) implies that 

ll/p - /pllL(n) + ill/p - + IKiP; S) + f ||P^/i||i 

< ll/s' “ /plli 2 (n) + log^(m/(5) + 5rm?s'^ + Qe5 (69) 

+2||;=.2||oo||p ~ 'S'^lli + 4||.=.2 ||oo<5. 

It remains to control ||Hi||oo and ||H 2 ||oo- To this end, we use matrix versions of Bernstein 
inequality. To bound ||H 2 ||oo, we use its standard version which yields that with probability 
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at least 1 — e * 

IIH 2 II 00 < 2[ E{fs{X) - fp{X)fX^ 

V ifsix) - f,ixmx\\^ t+io^] 

-^00 

where || • denotes the essential supremum norm in the space of random variables. Since 

E{fs{X) - U{X)fX^ < U^fs - /p||i,(n) 

and 

{fs{X)-fp{X))\\X\\^ <2U\ 

J^oo 

we get 

II= 2 |U < i\\\fs - Alli,(n)Cv^±!^ + c;2£±i^], (70) 

This implies that 

2||H2||oo||^ - 5'||i < 11/5 - /plli^(n) + 16U^i±^^\\p - STi (71) 

+8C/2^±i2^||p-5'||i. 

Note that 

16^2 t+log(2m) ||^_g7||2 

< 16[/2 *+l°f H \\p _ 5||2 + ig^2 t+M2n7) ^ ^2) (72) 

and 

8[/21±1h^||p-5'||i 

< + 8^2 t+log2m) ||p^(^ _ (73^ 

< 8C/2^±i2|(H!!d + g^ 2 t+log 2 m) ||p^(^ _ ^ ig ^2 I+log(2m) 

Since, for some constant C > 0, 

g^2 H-loi(2m) ||p^(^ - S)||l < ^\\Vl(P - S)|b 

< < ■ ||;. _ + p„^,i r„.2(,+to,(2„))7 

it follows from (71), (72) and (73) that 

2||‘=‘2||oo||p ~ <S^||i < ll/s ~ /plli2(n) 

+ 16[/2l±iH|(^||p_5||2 + 16[/2t±M2^(4^ +J2) (74) 

+8f/2 *+l°g""^) ||pfp||l + 16[/2 W°g2m) ^ 

+ill* - AIIL,n) + c"c/- '-"-’<‘ 7 ' 7 P"))t 
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Note that (70) also implies that 


|oo ^ 


m 


t+log(2m) 

n 


+ 


jj2 t+log(2m) 


(75) 


(since II /5 — /pH^jin) <'ni ^||5' — p \\2 < 2m ^). Let us substitute (74) and (75) in the last 
line of (69). Assume that 

2 t+ log( 2 m) 


e > 16t/" 


n 


and that constant D in assumption (57) is large enough so that 

n 4 

(recall inequality 8 ). It easily follows that with some constants ( 71 ,( 72 , 

ll/p “ /plli 2 (n) + 

< 2||/s - /plli 2 (n) + \og^{m/5) + Srm^e^ 


(76) 


(note that the term (t+iog( 2 m)) gf jg “absorbed” by the term Cirm^e^ \o^{m/5) 

of bound (76) provided that constant (7i is large enough). Since 


5 = 


m?n^ 


< u- 


,t + log( 2 m) 


< e 


n 


(recall that U'^ > m ^), we have ed < e^. Also, since U > m 


U /f + log(2m) 


5 = U\ 


11 + log( 2 m) 1 + log( 2 m) 


m 


n 


n 'm?'n? 


n 


<e\ 


Therefore, (76) implies that with some constant C 

\\fp-fp\\Uu) + mp-^s) 

< 211/5 - /plli 2 (n) +C{rm^e^ \og^{m/5) +rvn?£^^. 


(77) 


To bound ||Hi ||oo, we use a version of matrix Bernstein type inequality due to Koltchinskii 
(2011b) (see bound (2.7) of Theorem 2.7). Its version for a = 2 (with x Ua^) implies 
that for some constant K > 0 with probability at least 1 — 


< K 


a^] 


It + log( 2 m) 


nm 


\/<’(U 


{t + log( 2 m)) log^/^( 2 t/m^/^) 


n 


( 78 ) 


We choose 


e := D2 


'^^J t + log{2m) ^ (t+j^( 2 m)) log^/ 2 ( 2 m) 


nm 


n 
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with a sufficiently large constant D 2 to satisfy the condition ||Si||oo < 4e with probability 
at least 1 — (the rest of the assumptions we made on e are also satisfied with this choice). 

Bound (77) then implies that with some constant C and with probability at least 1 —3e“* 
the following inequality holds: 


W - 

+ C 

+ u^ 


■ /pIlLfn) ^ 211/5 - /p||i2(n) 

2 rm{t+ log{2m)) + log(2"^))^ log(2m) 

(Jc - \- cr^U -- 

, n ^ 

rm?{t + log^ m log^ n)^ log^(mn)1 


(79) 


Using bound (39) to replace S in \\fs — /p|li 2 (n) S' and adjusting the value of constant 
C to rewrite the probability bound as 1 — e“*, it is easy to complete the proof of (58). 
If S' = p, this also yields bound (59). Moreover, with a larger value of regularization 
parameter 


e ;= 


log(mn) 


1 1 + log(2m) y log log^ ^ 


nm 


n 


bound (77) and Lemma 17 easily imply bound (60). 


3.3 Optimality Properties of von Neumann Entropy Penalized Estimator p'" 

We start with upper bounds on the error of estimator p'^ (von Neumann entropy penalized 
least squares estimator defined by (7)) in Hellinger, Kullback-Leibler and Schatten g-norm 
distances for q E [1,2] for the trace regression model with Gaussian noise (Assumption 
4). To avoid the impact of “second order terms” on the upper bounds, we will make the 
following simplifying assumptions: 


U\ — log m < 1 and 
V n 


log^'^^ m log^ n log(mn) < 


(80) 


Recall that, for the Pauli basis, U = m so, the above assumptions hold if ra > log^m 
and (Tg is larger than (times a logarithmic factor). We will choose regularization 

parameter e as follows: 

( 81 ) 

log(mn) V nm 

with a sufficiently large constant Hi > 0. The next result shows that minimax rates of 
Theorem 4 are attained up to logarithmic factors for the estimator 


Theorem 21 There exists a constant C > 0 such that the following bounds hold for all 
r = 1,... ,m, for all p £ Sr,m and for all q £ [1,2] with probability at least 1 — : 


\\P" -P\\q < C 




n 


i-i 

9 


(logm)2 
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and 


3 

H'^{p^,p) < \/logmlog{mn) A 2 

yjn 


(83) 


K{p\W) < C 


3 


a^m^r 



\/logm log(mn). 


(84) 


Proof We will need the following simple lemma. 


Lemma 22 For all p G Sm and all I = 1,... ,m, there exists p' G Si^m such that 

\\p- p'wl < y- 

Proof Suppose that p = where Xj are the eigenvalues of p repeated with 

their multiplicities and Pj are orthogonal one-dimensional projectors. Note that {Aj : j = 
1,..., m} is a probability distribution on the set m}. Let be a random variable 

sampled from this distribution and ni,... , 1^1 be its i.i.d. copies. Then = p and 

^ E||P.-p||i E||P.||i-|H|i 1-Mi ^l 

2 I I I - I' 

Therefore, there exists a realization ni = ki,... ,vi = ki of r.v. ui,... , 1^1 such that 


E 


X A, - P 


i-'Y.Pk,-P 


< 


Denote p' := I ^ Y!j=i Pkj - Then, p' G Si^rn and \\p - p'\\l < j 


.012 ^ 1 


First, we will prove bound (82) for g = 2. To this end, we use oracle inequality (58) with 
t = 21ogm-|-log2 and with oracle S = p' ^ Si^m such that \\p — p '\\2 < j- Under simplifying 
assumptions (80) it yields that with probability at least 1 — \m~‘^ 


- p\\l = 


fp 


\2 

\L2{U) 


< 


T^l logm 


o- ^ 3/2 

where r := . On the other hand, using the same inequality with S = p £ Sr,m yields 

the bound 


that also holds with probability at least 1 — 
1 — m 


- pWI < r^rlogm 

Therefore, with probability at least 


-2 


W - p\\l < (y Tr^Oogm) /\r^rlogm. 


(85) 


Let I = ;^ 7 - 7 ==. If I G [l,m], set I := [Z]. Otherwise, if Z > m, set I := m and, if Z < 1, set 
Z := 1. An easy computation shows that with such a choice of Z bound (85) implies (82) for 
q = 2. 
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Next we use bound (60) that, for t = 21ogm, implies under assumptions (80) that with 
some constant C and with probability at least 1 — 

K{pm < ( 86 ) 

which is bound (84). Bound (83) also holds in view of inequality (8). 

Now, we prove bound (82) for q = 1 (the bound for q G [1,2] will then follow by inter¬ 
polation). To this end, we will use the following lemma (see Proposition 1 in Koltchinskii 
2011a) that shows that if two density matrices are close in Hellinger distance and one of 
them is “concentrated around a subspace” L, then another one is also “concentrated around” 
L. 

Lemma 23 For any L C C™ and all 81,82 G 5^, 

\\Vi 8 i\\i< 2 \\Vi 82 \\i + 2 H\ 8 i, 82 ). 

We apply this lemma to 81 = , 82 = p and L = supp(/9) so that Vj^p = 0. It yields that 

\\Vip^,< 2 H\p-,p). 


Therefore, 

||^^-p||i < \\PL{p^-p)\\i + \\Vj;{p^-p)\\i < V^\\p^-p\\2+\\Vip^\\i < V^\\p^-p\\2+2H‘^{ppp). 

(87) 

Using bounds (82) for g = 2 and (83), we get from (87) that 

3 

W - Ph < x/logmlog{mn) /\2, (88) 

which is equivalent to (82) for q = 1- Note that by choosing t = 21ogm -|- log 2 + 2 (which 
might have an impact only on the constant), we could make probability bounds in (82) for 
q = 2 and (83) to be at least 1 — ^m~‘^ implying that (88) holds with probability at least 
1 — as it is claimed in the theorem. 

To complete the proof, it is enough to use the interpolation inequality of Lemma 1, It 
follows that, for q G (1, 2), 

Substituting bound (82) for g = 1 and q = 2 into the last inequality yields the result for an 
arbitrary g G (1, 2). ■ 


Similarly, in the case of trace regression with bounded response (see Assumption 3), 
minimax rates of Theorem 7 are also attained for the estimator p^ (up to log factors). In 
this case, assume that Assumption 3 holds with U = U and, in addition, let us make the 
following simplifying assumptions: 


U\ 


' m log m 


n 


< 1 and 


log log 2 n < m log m. 


(89) 
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For the Pauli basis (U = m the first assumption holds if u > logm. The second as¬ 
sumption does hold unless n is extremely large (n ^ 2 exp{m\ogm}y Under these assumptions, 
we will use the following value of regularization parameter £ : 

U /log(2m) 
log(mn) V nm 

The following version of Theorem 21 holds in the bounded regression case (with a similar 
proof). 


Theorem 24 There exists a constant C > 0 such that the following bounds hold for all 
r = 1 ,... ,m, for all p G Sr,m and for all q G [1,2] with probability at least 1 — : 


-p\\q<C\^ - -J= —yk)gmlog(^“^)/^(mn)/\^—-(logm) 2 “^ j2, (90) 


\ 


2 'j' 

H‘^{p^,p) < C — yiogTnlog(mn) A 2 
\/u 


and 


Um^r 


K{p\\p’^) < C —Vlogmlog(mn). 
V n 


(91) 

(92) 


Remark 25 In the case of Pauli basis, the minimax optimal rates (up to constants and 
logarithmic factors) are: ‘i A2 for Schatten q-norm distances for q G [1,2]; ^ 

for nuclear norm, squared Hellinger and Kullback-Leibler distances (provided the mr < ^/n). 
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